1000 Genomes Project data is available at both Ensembl and the UCSC Genome Browser.
More information on accessing 1000 Genomes Project data in genome browsers can be found on the Browser page.
Ensembl provides consequence information for the variants. The variants that are loaded into the Ensembl database and have consequence types assigned are displayed on the Variation view. Ensembl can also offer consequence predictions using their Variant Effect Predictor (VEP).
You can see individual genotype information in the Ensembl browser by looking at the Individual Genotypes section of the page from the menu on the left hand side.
You can tell when a VCF file contains a phased genotype as the delimiter used in the GT field is a pipe symbol | e.g
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096
10 60523 rs148087467 T G 100 PASS AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28
The VCF files produced by the final phase of the 1000 Genomes Project (phase 3) are phased. They can be found in the final release directory from the project and in the directory supporting the final publications.
The majority of the VCF files in official releases over the life time of the project contained phased variants. This is also true for the pilot, phase 1 and final phase 3 data sets.
The phase 1 release files contain global R2 values but you can also use the VCF to plink converter if you wish to use our files with haploview or another similar tool.
The 1000 Genomes Project SNPs and short indels were all submitted to dbSNP and longer structural variants to the DGVa.
Where possible, release VCF files contain the appropriate IDs in the ID column, such as dbSNP rs IDs.
The archives contain variants discovered by the final phase of the 1000 Genomes Project (phase 3) and also by the preliminary pilot and phase 1 stages of the project. As methods were developed during the project, phase 3 represents the final data set.
No, not all the variants in the browsers produced by the 1000 Genomes Project were discovered by the 1000 Genomes Project.
The data from the 1000 Genomes Project is available in a number of browsers, including browsers produced by the 1000 Genomes Project, which reflect the major data releases associated with the pilot, phase 1 and phase 3 publications from the 1000 Genomes Project. More information on this is available on the browsers page.
The content of the 1000 Genomes Project Browsers, maintained during the 1000 Genomes Project, are based on custom versions of the Ensembl browser. These databases contain the Ensembl core features (genes and transcripts), regulatory elements from the Ensembl Regulatory Build and variation data from the Ensembl Variation database.
As well as 1000 Genomes Project variation data, Ensembl variation contains data from dbSNP, ClinVar, COSMIC, dbGaP, dbVAR, EGA and many other sources.
We do not provide FASTA files annotated for 1000 Genomes variants. You can create such a file with a VCFtools Perl script called vcf-consensus.
An example set of command lines would be:
#Extract the region and individual of interest from the VCF file you want to produce the consensus from
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c > HG00098.vcf.gz
#Index the new VCF file so it can be used by vcf-consensus
tabix -p vcf HG00098.vcf.gz
#Run vcf-consensus
cat ref.fa | vcf-consensus HG00098.vcf.gz > HG00098.fa
You can get more support for VCFtools on their help mailing list.
This can be done using Ensembl’s Biomart.
This YouTube video gives a tutorial on how to do it.
The basic steps are:
If you would like the coordinates on GRCh38, you should use the main Ensembl site, however if you would like the coordinates on GRCh37, you should use the dedicated GRCh37 site.
Either the Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.
The Data Slicer, described in more detail in the documentation, has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.
To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz
The 1000 genomes browser at browser.1000genomes.org but all data is accessible from the Ensembl browser at grch37.ensembl.org You can see individual genotype information in the browser by looking at the Sample Genotypes section of the a variant page. This can be reached from the menu on the left hand side of the page. You can find a particular variant by putting its rs number in the search box visible at the top right hand corner of every browser page.
Our pilot data is all presented with respect to NCBI36 and our main project data is all presented with respect to GRCh37. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.
If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI
The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.
Please look at the software’s website to find those files.
All the pilot data remains on our ftp site under the pilot_data directory EBI/NCBI. The variants which are discussed in the pilot paper can also be found on the ftp site EBI/NCBI.
Please note these data are all mapped to the NCBI36 human reference.
The 1000 Genomes Project shares some samples with the HapMap project; any sample which starts with NA was likely part of the HapMap project. In the pilot stages of the project HapMap genotypes were also used to help quality control the data and identify sample swaps and contamination. Since phase 1 the HapMap data has not been used by the 1000 Genomes Project, and all genotypes were independantly identified by 1000 Genomes.
Our VCF files contain global and super population alternative allele frequencies. You can see this in our most recent release. For multi allelic variants, each alternative allele frequency is presented in a comma separated list.
An example info column which contains this information looks like
1 15211 rs78601809 T G 100 PASS AC=3050;AF=0.609026;AN=5008;NS=2504;DP=32245;EAS_AF=0.504;AMR_AF=0.6772;AFR_AF=0.5371;EUR_AF=0.7316;SAS_AF=0.6401;AA=t|||;VT=SNP
If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in our browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.
This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac
An example command set using files from our phase 1 release would look like
grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list
vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac |
bgzip -c > CEU.chr13.phase1.vcf.gz
</pre>
Once you have this file you can calculate your frequency by dividing AC (allele count) by AN (allele number).
Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.
There are two ways to get a subset of a VCF file.
The first is to use the Data Slicer tool from our browser which is documented here. This tool gives you a web interface requesting the URL of any VCF file and the genomic location you wish to get a sub-slice for. This tool also works for BAM files. This tool also allows you to filter the file for particular individuals or populations if you also provide a panel file.
The second method is using tabix on the command line. e.g
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768
Specifications for the VCF format, and a C++ and Perl tool set for VCF files can be found at vcftools on sourceforge
Please note that all our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a vcf file using a chromosome name in the style chrN as shown below it will not work.
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768
Our released genotypes are created not just using sequence evidence but also imputation. If an individual has no coverage at a particular location but overall we have been able to determine there is variation at that location then we can statistically infer the genotype for that variant in that individual using haplotype information. This means we are able to provide complete haplotypes for all the variation we discover. For more information about how the genotype calling was done please refer to our phase 1 publication.
All the variants in both our VCF files and on the browser are always reported on the forward strand.
The project has two releases of structural variation. The pilot paper data directory contains vcf files for deletions, mobile element insertions, tandem duplications and novel sequence both for the low coverage and trio pilot studies. Our phase1 release integrated release contains deletions together with the SNPs and short indels.
The VCF files on our site cover a wide variety of different versions but our most recent release VCF files are in format version 4.1
The longer structural variants predicted as part of the pilot project were submitted to the DGVa and given the accession estd59.
Chromosome X, Y and MT variants are available for the phase 3 variant set. The chrX calls were made in the same manner as the autosome variant calls and as such are part of the integrated call set and include SNPs, indels and large deletions, note the males in this call set are show as haploid for any chrX SNP not in the Pseudo Autosomal Region (PAR). The chrY and MT calls were made independently. Both call sets are presented in an integrated file in the phase3 FTP directory, chrY and chrMT. ChrY has snps, indels and large deletions. ChrMT only has snps and indels. For more details about how these call sets were made please see the phase 3 paper.
Our variant files are released via our release directory in directories named for the sequence index freeze they are based on.
You may find information about the final 1000 Genomes release in contained in 20130502 and is described in where is your most recent release?
A stable earlier release based on 1092 unrelated samples is phase 1 data release, it can be found under phase1/analysis_results/integrated_call_sets. The phase 1 data set contains information on all autosomes and chrX, Y and chrMT. The phase 1 publication is based on this data set.
The pilot release represents results obtained in the three pilot studies of the project (low coverage, high coverage trios and exome). The release data can be found at here. The publication about the findings of pilot studies is in this pdf.
You may also find variant files in our technical/working directory but please be aware these are experimental files which represent a work in progress and should always be treated with caution
The final 1000 Genomes phase 3 analysis calculated consequences based on GENCODE annotation and this can be found in the directory: release/20130502/supporting/functional_annotation/
Ensembl can also provides consequence information for the variants. The variants that are loaded into the Ensembl database and have consequence types assigned and displayed on the Variation view. Ensembl can also offer consequence predictions using their Variant Effect Predictor (VEP).
Please note the phase 3 annotations and the Ensembl annotations visible via the browser due to using different versions of gene and non coding annotation.
The ancestral alleles associated with the phase 1 release where generated using two different processes.
The SNP ancestral alleles were derived from Ensembl Compara release 59. The alignments used to generate them can be found in the phase1/supporting directory.
The indel ancestral alleles were generated using an separate process
The deletions should not have any ancestral alleles
As the majority of sites in the genome only has only been sequenced to low coverage, in all our individuals some sites genotypes will be based on imputation.
The process used to create our genotypes first gave our merged sites and genotype likelihoods sets to Beagle to generate initial haplotypes (using 50 interations across all samples) and these were refined using a modified version of Thunder (it used 300 states chosen by longest matching haplotype at each iteration in addition to 100 randomly chosen states).
This process means we are unable to precisely identify which sites used imputation to generate their genotype. Without this process the approximate error rate for our heterozygous sites would be 20% so you can estimate that 20% of our heterozgous sites will have been changed on the basis of imputation. The sites covered by our exome sequencing represent our highest accuracy sites and these are the least likely to have been changed by this process. The converse is also true any site without any sequence alignment will have been imputed. You can find the depth of coverage at any site using our bam files. Other sites may have been given greater evidence on the basis of the imputation and refinement process.
You can find out more about this in our Phase 1 paper.
In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to 2 decimal places.
The pilot data for the 1000 genomes project was all mapped to NCBI36/hg18 build of the human assembly. When the data was been loaded into dbSNP it was mapped to GRCh37/hg19 which is accessible from both Ensembl and UCSC but this does mean that the coordinates from the pilot data on the 1000 Genomes ftp site will be different to the coordinates presented in Ensembl and UCSC.
You can also view 1000 Genomes variants mapped to GRCh38 on Ensembl and UCSC.
The phase 1 variants list released in 2012 and the phase 3 variants list released in 2014 overlap but phase 3 is not a complete superset of phase 1. The variant positions between phase 3 and phase 1 releases were compared using their positions. This shows that 2.3M phase 1 sites are not present in phase 3. Of the 2.3M sites, 1.92M are SNPs, the rest are either indels or structural variations (SVs).
The difference between the two lists can be explained by a number of different reasons.
1. Some phase 1 samples were not used in phase 3 for various reasons. If a sample was not part of phase 3, variants private to this sample are not be part of the phase 3 set.
2. Our input sequence data is different. In phase 1 we had a mixture of both read lengths 36bp to >100bp and a mixture of sequencing platforms, Illumina, ABI SOLiD and 454. In phase 3 we only used data from the Illumina sequencing platform and we only used read lengths of 70bp+. We believe that these calls are higher quality, and that variants excluded this way were probably not real.
3. The first two reasons listed explain 548k missing SNPs, leaving 1.37M SNPs still to be explained.
The phase 1 and phase 3 variant calling pipelines are different. Phase 3 had an expanded set of variant callers, used haplotype aware variant callers and variant callers that used de novo assembly. It considered low coverage and exome sequence together rather than independently. Our genotype calling was also different using ShapeIt2 and MVNcall, allowing integration of multi allelic variants and complex events that weren’t possible in phase 1.
891k of the 1.37M sites missing from phase 1 were not identified by any phase 3 variant caller. These 891k SNPs have relatively high Ts/Tv ratio (1.84), which means these were likely missed in phase 3 because they are very rare, not because they are wrong; the increase in sample number in phase 3 made it harder to detect very rare events especially if the extra 1400 samples in phase 3 did not carry the alternative allele.
481k of these SNPs were initially called in phase 3. 340k of them failed our initial SVM filter so were not included in our final merged variant set. 57k overlapped with larger variant events so were not accurately called. 84k sites did not make it into our final set of genotypes due to losses in our pipeline. Some of these sites will be false positives but we have no strong evidence as to which of these sites are wrong and which were lost for other reasons.
4. The reference genomes used for our alignments are different. Phase 1 alignments were aligned to the standard GRCh37 primary reference including unplaced contigs. In phase 3 we added EBV and a decoy set to the reference to reduce mismapping. This will have reduced our false positive variant calling as it will have reduced mismapping leading to false SNP calls. We cannot quantify this effect.
We have made no attempt to eludcidate why our SV and indel numbers changed. Since the release of phase 1 data, the algorithms to detect and validate indels and SVs have improved dramatically. By and large, we assume the indels and SVs in phase 1 that are missing from phase 3 are false positive in phase 1.
You can get more details about our comparison from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/phase1_sites_missing_in_phase3/
The phase 3 VCF files released in June 2014 contain overlapping and duplicate sites.
This is due to an error in the processing pipeline used when sets of variant calls were combined. Originally, all multi-allelic sites were seperated into individual lines in the VCF file during the pipeline but the recombination process did not always succeed, leaving us with a small number of sites with overlapping or duplicate call records. This is most commonly seen in chromosome X.
The simplest solution to this is to ignore duplicate sites in any analysis. If you wish to use one or both of a pair of duplicate sites in your own analysis, you should use the GRCh37 alignment files to recall the genotypes of interest in the individuals you are interested in to resolve the conflict.
There are a small number of variants which have an Allele Count of 0 and an Allele Frequency of 0.
This is because the original sample list for phase1 had 1094 samples on it. After our integrated genotyping processes 2 samples where discovered to have very discordant genotypes.
The decision was made to leave in any variant which only been discovered in one or both of these individuals. The Analysis group is still confident in their sites but not in their genotypes. In doing this we are left with some variant sites where no sample holds the non reference allele.
Our August 2010 call set represents a merge of various different independent call sets. Not all the call sets in the merge had genotypes associated with them, as this merge was carried out using a predefined rules which has led to individuals or whole variant sites having no genotype and this is described as ./. in vcf 4.0. In our November 2010 call and all subsequent call sets all sites have genotypes for all individuals for chr1-22 and X.
There are two main reasons a tabix fetch might fail.
All our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a VCF file using a chromosome name in the style chrN as shown below it will not work.
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804 ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768
Also tabix does not fail when streaming remote files but instead just stops streaming. This can lead to incomplete lines with final rows with unexpected numbers of columns when trying to stream large sections of the file. The only way to avoid this is to download the file and work with it locally.
The 1000 Genomes Project submits all its variants to archives like dbSNP or the DGVa. If it hasn’t yet made it to dbSNP this means it is likely to be a new site which we haven’t yet submitted. There may also be some old sites which we subsequently discover to be false discoveries which we then suppress.
As far as our overlap with the HapMap site list goes, The majority of HapMap SNPs are found in the 1000 Genomes Project, there will be a small number of sites we fail to find using next generation sequencing but most sites from HapMap which aren’t found by the 1000 Genomes Project will be false discoveries by HapMap. There are a lot of SNPs from the 1000 Genomes Project and other next generation sequencing projects which won’t be part of HapMap as HapMap is based on an older genotyping technology when such rapid variant discovery using sequencing was not possible.